Multi-functional execution lane for image processor

ABSTRACT

An apparatus is described that includes an execution unit having a multiply add computation unit, a first ALU logic unit and a second ALU logic unit. The ALU unit is to perform first, second, third and fourth instructions. The first instruction is a multiply add instruction. The second instruction is to perform parallel ALU operations with the first and second ALU logic units operating simultaneously to produce different respective output resultants of the second instruction. The third instruction is to perform sequential ALU operations with one of the ALU logic units operating from an output of the other of the ALU logic units to determine an output resultant of the third instruction. The fourth instruction is to perform an iterative divide operation in which the first ALU logic unit and the second ALU logic unit operate during to determine first and second division resultant digit values.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of and claims priority to U.S. patentapplication Ser. No. 15/591,955, filed on May 10, 2017, which is acontinuation of U.S. patent application Ser. No. 14/960,334, filed onDec. 4, 2015 (now U.S. Pat. No. 9,830,150). The disclosures of the priorapplications are considered part of and are incorporated by reference inthe disclosure of this application.

FIELD OF INVENTION

The field of invention pertains generally to the computing sciences,and, more specifically, to a multi-functional execution lane for animage processor.

BACKGROUND

Image processing typically involves the processing of pixel values thatare organized into an array. Here, a spatially organized two dimensionalarray captures the two dimensional nature of images (additionaldimensions may include time (e.g., a sequence of two dimensional images)and data type (e.g., colors). In a typical scenario, the arrayed pixelvalues are provided by a camera that has generated a still image or asequence of frames to capture images of motion. Traditional imageprocessors typically fall on either side of two extremes.

A first extreme performs image processing tasks as software programsexecuting on a general purpose processor or general purpose-likeprocessor (e.g., a general purpose processor with vector instructionenhancements). Although the first extreme typically provides a highlyversatile application software development platform, its use of finergrained data structures combined with the associated overhead (e.g.,instruction fetch and decode, handling of on-chip and off-chip data,speculative execution) ultimately results in larger amounts of energybeing consumed per unit of data during execution of the program code.

A second, opposite extreme applies fixed function hardwired circuitry tomuch larger blocks of data. The use of larger (as opposed to finergrained) blocks of data applied directly to custom designed circuitsgreatly reduces power consumption per unit of data. However, the use ofcustom designed fixed function circuitry generally results in a limitedset of tasks that the processor is able to perform. As such, the widelyversatile programming environment (that is associated with the firstextreme) is lacking in the second extreme.

A technology platform that provides for both highly versatileapplication software development opportunities combined with improvedpower efficiency per unit of data remains a desirable yet missingsolution.

SUMMARY

An apparatus is described that includes an execution unit having amultiply add computation unit, a first ALU logic unit and a second ALUlogic unit. The ALU unit is to perform first, second, third and fourthinstructions. The first instruction is a multiply add instruction. Thesecond instruction is to perform parallel ALU operations with the firstand second ALU logic units operating simultaneously to produce differentrespective output resultants of the second instruction. The thirdinstruction is to perform sequential ALU operations with one of the ALUlogic units operating from an output of the other of the ALU logic unitsto determine an output resultant of the third instruction. The fourthinstruction is to perform an iterative divide operation in which thefirst ALU logic unit and the second ALU logic alternatively operateduring an iteration to determine a quotient digit value.

An apparatus is described comprising an execution unit of an imageprocessor. The ALU unit comprises means for executing a firstinstruction, the first instruction being a multiply add instruction. TheALU unit comprises means for executing a second instruction includingperforming parallel ALU operations with first and second ALU logic unitsoperating simultaneously to produce different respective outputresultants of the second instruction. The ALU unit comprises means forexecuting a third instruction including performing sequential ALUoperations with one of the ALU logic units operating from an output ofthe other of the ALU logic units to determine an output resultant of thethird instruction. The ALU unit comprises means for executing a fourthinstruction including performing an iterative divide operation in whichthe first ALU logic unit and the second ALU logic unit operate todetermine first and second digit resultant digit values.

FIGURES

The following description and accompanying drawings are used toillustrate embodiments of the invention. In the drawings:

FIG. 1 shows a stencil processor component of an image processor;

FIG. 2 shows an instance of an execution lane and its coupling to a twodimensional shift register;

FIG. 3 shows relative delay of functions performed by an embodiment ofthe execution lane of FIG. 2;

FIG. 4 shows a design for a multi-functional execution lane;

FIGS. 5a and 5b show circuitry and a methodology to perform an iterativedivide operation;

FIG. 6 shows a methodology performed by the execution lane describedwith respect to FIGS. 3 through 5 a,b;

FIG. 7 shows an embodiment of a computing system.

DETAILED DESCRIPTION

FIG. 1 shows an embodiment of a stencil processor architecture 100. Astencil processor, as will be made more clear from the followingdiscussion, is a processor that is optimized or otherwise designed toprocess stencils of image data. One or more stencil processors may beintegrated into an image processor that performs stencil based tasks onimages processed by the processor. As observed in FIG. 1, the stencilprocessor includes a data computation unit 101, a scalar processor 102and associated memory 103 and an I/O unit 104. The data computation unit101 includes an array of execution lanes 105, a two-dimensional shiftarray structure 106 and separate random access memories 107 associatedwith specific rows or columns of the array.

The I/O unit 104 is responsible for loading input “sheets” of image datareceived into the data computation unit 101 and storing output sheets ofdata from the data computation unit externally from the stencilprocessor. In an embodiment, the loading of sheet data into the datacomputation unit 101 entails parsing a received sheet into rows/columnsof image data and loading the rows/columns of image data into the twodimensional shift register structure 106 or respective random accessmemories 107 of the rows/columns of the execution lane array (describedin more detail below).

If the sheet is initially loaded into memories 107, the individualexecution lanes within the execution lane array 105 may then load sheetdata into the two-dimensional shift register structure 106 from therandom access memories 107 when appropriate (e.g., as a load instructionjust prior to operation on the sheet's data). Upon completion of theloading of a sheet of data into the register structure 106 (whetherdirectly from a sheet generator or from memories 107), the executionlanes of the execution lane array 105 operate on the data and eventually“write back” finished data externally from the stencil processor, or,into the random access memories 107. If the later the I/O unit 104fetches the data from the random access memories 107 to form an outputsheet which is then written externally from the stencil processor.

The scalar processor 102 includes a program controller 109 that readsthe instructions of the stencil processor's program code frominstruction memory 103 and issues the instructions to the executionlanes in the execution lane array 105. In an embodiment, a single sameinstruction is broadcast to all execution lanes within the array 105 toeffect a SIMD-like behavior from the data computation unit 101. In anembodiment, the instruction format of the instructions read from scalarmemory 103 and issued to the execution lanes of the execution lane array105 includes a very-long-instruction-word (VLIW) type format thatincludes more than one opcode per instruction. In a further embodiment,the VLIW format includes both an ALU opcode that directs a mathematicalfunction performed by each execution lane's ALU and a memory opcode(that directs a memory operation for a specific execution lane or set ofexecution lanes).

The term “execution lane” refers to a set of one or more execution unitscapable of executing an instruction (e.g., logic circuitry that canexecute an instruction). An execution lane can, in various embodiments,include more processor-like functionality beyond just execution units,however. For example, besides one or more execution units, an executionlane may also include logic circuitry that decodes a receivedinstruction, or, in the case of more MIMD-like designs, logic circuitrythat fetches and decodes an instruction. With respect to MIMD-likeapproaches, although a centralized program control approach has largelybeen described herein, a more distributed approach may be implemented invarious alternative embodiments (e.g., including program code and aprogram controller within each execution lane of the array 105).

The combination of an execution lane array 105, program controller 109and two dimensional shift register structure 106 provides a widelyadaptable/configurable hardware platform for a broad range ofprogrammable functions. For example, application software developers areable to program kernels having a wide range of different functionalcapability as well as dimension (e.g., stencil size) given that theindividual execution lanes are able to perform a wide variety offunctions and are able to readily access input image data proximate toany output array location.

During operation, because of the execution lane array 105 andtwo-dimensional shift register 106, multiple stencils of an image can beoperated on in parallel (as is understood in the art, a stencil istypically implemented as a contiguous N×M or N×M×C group of pixelswithin an image (where N can equal M)). Here, e.g., each execution laneexecutes operations to perform the processing for a particular stencilworth of data within the image data, while, the two dimensional shiftarray shifts its data to sequentially pass the data of each stencil toregister space coupled to the execution lane that is executing the tasksfor the stencil. Note that the two-dimensional shift register 106 mayalso be of larger dimension than the execution lane array 105 (e.g., ifthe execution lane array is of dimension X×X, the two dimensional shiftregister 106 may be of dimension Y×Y where Y>X). Here, in order to fullyprocess stencils, when the left edge of the stencils are being processedby the execution lanes, the data in the shift register 106 will “pushout” off the right edge of the execution lane array 105. The extradimension of the shift register 106 is able to absorb the data that ispushed off the edge of the execution lane array.

Apart from acting as a data store for image data being operated on bythe execution lane array 105, the random access memories 107 may alsokeep one or more look-up tables. In various embodiments one or morescalar look-up tables may also be instantiated within the scalar memory103.

A scalar look-up involves passing the same data value from the samelook-up table from the same index to each of the execution lanes withinthe execution lane array 105. In various embodiments, the VLIWinstruction format described above is expanded to also include a scalaropcode that directs a look-up operation performed by the scalarprocessor into a scalar look-up table. The index that is specified foruse with the opcode may be an immediate operand or fetched from someother data storage location. Regardless, in an embodiment, a look-upfrom a scalar look-up table within scalar memory essentially involvesbroadcasting the same data value to all execution lanes within theexecution lane array 105 during the same clock cycle.

FIG. 2 shows another, more detailed depiction of the unit cell for anALU execution unit 205 within an execution lane 201 and correspondinglocal shift register structure. The execution lane and the registerspace associated with each location in the execution lane array is, inan embodiment, implemented by instantiating the circuitry observed inFIG. 2 at each node of the execution lane array. As observed in FIG. 2,the unit cell includes an execution lane 201 coupled to a register file202 consisting of four registers R1 through R4. During any cycle, theALU execution unit may read from any of registers R1 through R4 andwrite to any of registers R1 through R4.

In an embodiment, the two dimensional shift register structure isimplemented by permitting, during a single cycle, the contents of any of(only) one of registers R1 through R3 to be shifted “out” to one of itsneighbor's register files through output multiplexer 203, and, havingthe contents of any of (only) one of registers R1 through R3 replacedwith content that is shifted “in” from a corresponding one if itsneighbors through input multiplexers 204 such that shifts betweenneighbors are in a same direction (e.g., all execution lanes shift left,all execution lanes shift right, etc.). In various embodiments, theexecution lanes themselves execute their own respective shiftinstruction to effect a large scale SIMD two-dimensional shift of theshift register's contents. Although it may be common for a same registerto have its contents shifted out and replaced with content that isshifted in on a same cycle, the multiplexer arrangement 203, 204 permitsfor different shift source and shift target registers within a sameregister file during a same cycle.

As depicted in FIG. 2 note that during a shift sequence an executionlane will shift content out from its register file 202 to each of itsleft, right, top and bottom neighbors. In conjunction with the sameshift sequence, the execution lane will also shift content into itsregister file from a particular one of its left, right, top and bottomneighbors. Again, the shift out target and shift in source should beconsistent with a same shift direction for all execution lanes (e.g., ifthe shift out is to the right neighbor, the shift in should be from theleft neighbor).

Although in one embodiment the content of only one register is permittedto be shifted per execution lane per cycle, other embodiments may permitthe content of more than one register to be shifted in/out. For example,the content of two registers may be shifted out/in during a same cycleif a second instance of the multiplexer circuitry 203, 204 observed inFIG. 2 is incorporated into the design of FIG. 2. Of course, inembodiments where the content of only one register is permitted to beshifted per cycle, shifts from multiple registers may take place betweenmathematical operations by consuming more clock cycles for shiftsbetween mathematical operations (e.g., the contents of two registers maybe shifted between math ops by consuming two shift ops between the mathops).

If less than all the content of an execution lane's register files areshifted out during a shift sequence note that the content of the nonshifted out registers of each execution lane remain in place (do notshift). As such, any non shifted content that is not replaced withshifted in content persists local to the execution lane across theshifting cycle. A memory execution unit, not shown in FIG. 2 forillustrative ease, may also exist in each execution lane 201 toload/store data from/to the random access memory space that isassociated with the execution lane's row and/or column within theexecution lane array. Here, the memory unit acts as a standard M unit inthat it is often used to load/store data that cannot be loaded/storedfrom/to the execution lane's own register space. In various embodiments,the primary operation of the M unit is to write data from a localregister into memory, and, read data from memory and write it into alocal register.

With respect to the ISA opcodes supported by the ALU unit 205 of thehardware execution lane 201, in various embodiments, the mathematicalopcodes supported by the ALU unit 205 may include any of the followingALU operations: add (ADD), substract (SUB), move (MOV), multiple (MUL),multiply-add (MAD), absolute value (ABS), divide (DIV), shift-left(SHL), shift-right (SHR), return min or max (MIN/MAX), select (SEL),logical AND (AND), logical OR (OR), logical XOR (XOR), count leadingzeroes (CLZ or LZC) and a logical complement (NOT). An embodiment of anALU unit 205 or portion thereof described in more detail below withrespect to FIGS. 3 through 5. As described just above, memory accessinstructions can be executed by the execution lane 201 to fetch/storedata from/to their associated random access memory. Additionally thehardware execution lane 201 supports shift op instructions (right, left,up, down) to shift data within the two dimensional shift registerstructure. As described above, program control instructions are largelyexecuted by the scalar processor of the stencil processor.

FIG. 3 shows a consumption time map for an execution unit, or portionthereof, of an execution lane as described just above. Specifically,FIG. 3 maps out the amount of time consumed by each of a number ofdifferent instructions that can be executed by the execution unit. Asobserved in FIG. 3, the execution unit can perform: 1) a multiply-addinstruction (MAD) 301; 2) two full width (FW) or four half width (HW)ALU operations in parallel 302; 3) a double width (2×W) ALU operation303; 4) a FUSED operation of the form ((C op D) op B) 304; and, 5) aniterative divide (DIV) operation 306.

As observed in FIG. 3, the MAD operation 301, by itself, consumes themost time amongst the various instructions that the execution unit canexecute. As such, a design perspective is that the execution unit can beenhanced with multiple ALU logic units, besides the logic that performsthe MAD operation, to perform, e.g., multiple ALU operations in parallel(such as operation 302) and/or multiple ALU operations in series (suchas operation 304).

FIG. 4 shows an embodiment of a design for an execution unit 405 thatcan support the different instructions illustrated in FIG. 3. Asobserved in FIG. 4, the execution unit 405 includes a first ALU logicunit 401 and a second ALU logic unit 402 as well as a multiply-add logicunit 403. Inputs from the register file are labeled A, B, C, D whileoutputs written back to the register file are labeled X and Y. As such,the execution unit 405 is a 4 input port, 2 output port execution unit.

The multiply add logic unit 403, in an embodiment, performs a fullmultiply-add instruction. That is, the multiply-add logic unit 403performs the function (A*B)+(C,D) where A is a full width input operand,B is a full width operand and (C,D) is a concatenation of two full widthoperands to form a double width summation term. For example, if fullwidth corresponds to 16 bits, A is 16 bits, B is 16 bits and thesummation term is 32 bits. As is understood in the art, a multiply addof two full width values can produce a double width resultant. As such,the resultant of the MAD operation is written across the X, Y outputports where, e.g., X includes the top half of the resultant and Yincludes the bottom half of the resultant. In a further embodiment themultiply-add unit 403 supports a half width multiply add. Here, e.g.,the lower half of A is used as a first multiplicand, the lower half of Bis used as a second multiplicand and either C or D (but not aconcatenation) is used as the addend.

As mentioned above with respect to FIG. 3, the execution of the MADoperation may consume more time than a typical ALU logic unit. As such,the execution unit includes a pair of ALU logic units 401, 402 toprovide not only for parallel execution of ALU operations but sequentialALU operations as well.

Here, referring to FIGS. 3 and 4, with respect to the dual parallel FWoperation 302, the first ALU logic unit 401 performs the firstfull-width ALU operation (A op B) while the second ALU performs thesecond full-width ALU operation (C op D) in parallel with the first.Again, in an embodiment, full width operation corresponds to 16 bits.Here, the first ALU logic unit 401 writes the resultant of (A op B) intoregister X while the second ALU logic unit 402 writes the resultant of(C op D) into register Y.

In an embodiment, the instruction format for executing the dual parallelfull width ALU operation 302 includes an opcode that specifies dualparallel full width operation and the destination registers. In afurther embodiment, the opcode, besides specifying dual parallel fullwidth operation, also specifies one or two ALU operations. If the opcodeonly specifies one operation, both ALU logic units 401, 402 will performthe same operation. By contrast if the opcode specifies first and seconddifferent ALU operations, the first ALU logic unit 401 performs one ofthe operations and the second ALU logic unit 402 performs the second ofthe operations.

With respect to the half width (HW) feature of operation 302, four halfwidth ALU operations are performed in parallel. Here, each of inputs A,B, C and D are understood to each include two separate input operands.That is, e.g., a top half of A corresponds to a first input operand, alower half of A corresponds to a second input operand, a top half of Bcorresponds to a third input operand, a lower half of B corresponds to afourth input operand, etc.

As such, ALU logic unit 401 handles two ALU operations in parallel andALU logic unit 402 handles two ALU operations in parallel. Thus, duringexecution, all four half width operations are performed in parallel. Atthe end of the operation 302, ALU logic unit 401 writes two half widthresultants into register X and ALU logic unit 402 writes two half widthresultants into register Y. As such, there are four separate half widthresultants in registers X and Y.

In an embodiment, the instruction format not only specifies thatparallel half width operation is to be performed but also specifieswhich ALU operation(s) is/are to be performed. In various embodimentsthe instruction format may specify that all four operations are the sameand only specify one operation and/or may specify that all fouroperations are different and specify four different operations. In thecase of the later, alternatively, to effect same operations for all fouroperations the instructions format may specify the same operation fourtimes. Various combinations of these instruction format approaches arealso possible.

With respect to the double wide ALU operation 303 of FIG. 3, in anembodiment, the execution unit 405 performs the operation (A,C) op (B,D)where (A,C) is a concatenation of inputs A and C that form a firstdouble wide input operand and (B,D) is a concatenation of inputs B and Dthat form a second double wide input operand. Here, a carry term may bepassed along carry line 404 from the first ALU logic unit 401 to thesecond ALU logic unit 402 to carry operations forward from full width todouble width.

That is, in an embodiment, the C and D terms represent the loweredordered halfs of the two double wide input operands. The second ALUlogic unit 402 performs the specified operation (e.g., ADD) on the twolower halfs and the resultant that is generated corresponds to the lowerhalf of the overall double wide resultant. As such, the resultant fromthe second ALU logic unit 402 is written into register Y. The operationon the lower halves may generate a carry term that is carried to thefirst ALU logic unit 401 which continues the operation of the tworespective upper halves A and C of the input operands. The resultantfrom the first ALU logic unit 401 corresponds to the upper half of theoverall resultant which is written into output register X. Becauseoperation on the upper halves by the first ALU logic unit 401 may not beable to start until it receives the carry term from the second ALU logicunit 402, the operation of the ALU logic units 402, 401 is sequentialrather than parallel. As such, as observed in FIG. 3, double widthoperations 303 may take approximately twice as long as parallelfull/half width operations 302.

Nevertheless, because the MAD operation 301 can consume more time thantwo consecutive ALU logic unit operations, the machine can be builtaround an execution unit 405 that can attempt to insert as much functionas it can into the time period consumed by its longest propagation delayoperation. As such, in an embodiment, the cycle time of the executionunit 405 corresponds to the execution time of the MAD instruction 301.In an embodiment, the instruction format for a double wide operationspecifies not only the operation to be performed, but also, that theoperation is a double wide operation.

With respect to the FUSED operation 304, the execution unit 405 performsthe operation (C op D) op B. Here, like the double wide ALU operation303 discussed just above, the dual ALU logic units 401, 402 operatesequentially because the second operation operates on the resultant ofthe first operation. Here, the second ALU logic unit 402 performs theinitial operation on full width inputs C and D. The resultant of thesecond ALU logic 402, instead of being written into resultant registerspace, is instead multiplexed into an input of the first ALU logic unit401 via multiplexer 406. The first ALU logic unit 401 then performs thesecond operation and writes the resultant into register X.

In a further embodiment, a half width FUSED operation can also beperformed. Here, operation is as described above except that only halfof the input operands are utilized. That is, for example, in calculating(C op D) op B, only the lower half of C and the lower half of D are usedto determine a half width result for the first operation, then, only thelower half of B is used along with the half width resultant of the firstoperation to perform the second operation. The resultant is written as ahalf width value in register X. Further still, two half width FUSEDoperations can be performed in parallel. Here, operation is as describedjust above simultaneously with the same logical operations but for thehigh half of the operands. The result is two half with values writteninto register X.

In an embodiment, the instruction format for a FUSED operation specifiesthat a FUSED operation is to be performed and specifies the twooperations. If the same operation is be performed twice, in anembodiment, the instruction only specifies the operation once orspecifies it twice. In a further embodiment, apart from specifying FUSEDoperation and the operation(s) to be performed, the instruction formatmay further specify whether full width or half width operation is to beperformed.

Operation 306 of FIG. 3 illustrates that an iterative divide operationcan also be performed by the execution unit. In particular, as explainedin more detail below, in various embodiments both ALU logic units 401,402 collaboratively participate in parallel during the iterative divideoperation.

FIGS. 5a and 5b pertain to an embodiment for executing the iterativedivide instruction 306 of FIG. 3. FIG. 5a shows additional circuitry tobe added to the execution unit circuitry 405 of FIG. 4 to enableexecution of the iterative divide instruction (with the exception of theALU logic units 501, 502 which are understood to be the same ALU logicunits 401, 402 of FIG. 4). FIG. 5b shows an embodiment of themicro-sequence operation of the execution unit during execution of theiterative divide instruction. As will be more clear from the followingdiscussion, a single execution of the instruction essentially performstwo iterations that are akin to the atomic act of long division in whichan attempt is made to divide the leading digit(s) of a numerator (thevalue being divided into) by a divisor (the value being divided into thenumerator).

For simplicity, 16 bit division will be described (those of ordinaryskill will be able to extend the present teachings to different widthembodiments). With the embodiment described herein performing two longdivision atomic acts, eight sequential executions of the instruction areused to fully divide a 16 bit numerator by a 16 bit divisor. That is,each atomic long division act corresponds to the processing of a nextsignificant bit of the numerator. Two such significant bits areprocessed during a single execution of the instruction. Therefore, inorder to process all bits of the numerator, eight sequential executionsof the instruction are needed to fully perform the complete division.The output of a first instruction is written to the register file andused as the input for the next subsequent instruction.

Referring to FIG. 5a , the numerator input is provided at the B inputand the divisor is presented at the D input. Again, in the presentembodiment, both of the B and D input operands are 16 bits. A “packed”32 bit data structure “PACK” that is a concatenation A, B of the A and Binput operands (the A operand is also 16 bits) can be viewed as aninitial data structure of a complete division process. As an initialcondition A is set to a string of sixteen zeroes (000 . . . 0) and B isthe numerator value.

Referring to FIGS. 5a and 5b , during a first micro-sequence, a leftshift of the PACK data structure is performed to create a data structureA[14:0], B[15], referred to as the most significant word of PACK (“PACKmsw”). The divisor D is then subtracted 511 from PACK msw by the secondALU logic unit 502. This operation corresponds to long division wherethe divisor is initially divided into the leading digit of thenumerator. Note that in an embodiment, the ALU logic units 501, 502 areactually three input ALUs and not two input ALUs as suggested by FIG. 4(the third input is reserved for the divisor D for the iterative divideoperation).

Different data processing procedures are then followed depending on thesign 512 of the result of the subtraction 511. Importantly, the firstquotient resultant bit (i.e., the first bit of the division result) isstaged to be written into the second to least significant bit of the Youtput port 509 (“NEXT B[1]”). If the result of the subtraction isnegative, the quotient resultant bit B[1] is set 513 to a 0. If theresult of the subtraction is positive, the quotient resultant bit B[1]is set 514 to a 1. The setting of this bit corresponds to the process inlong division where the first digit of the quotient result is determinedby establishing whether or not the divisor value can be divided into thefirst digit of the numerator.

Additionally, two different data structures are crafted and presented torespective input ports (“1”, “2”) of a multiplexer 506 (which may be thesame multiplexer as multiplexer 406 of FIG. 4). The first data structurecorresponds to a left shift of Pack msw (A[13:0], B[15], B[14]) and ispresented at input 1 of the multiplexer 506. The creation of this datastructure corresponds to the process in long division where the nextdigit of the numerator is appended to its most significant neighbor ifthe divisor does not divide into the most significant neighbor.

The second crafted data structure corresponds to a left shift of theresult of the subtraction 511 that was just performed by the second ALUlogic unit 502 appended with bit B[13] and is presented at the secondinput (“2”) of the multiplexer 506. The creation of this data structurecorresponds to the situation in long division where a divisor dividesinto the first digit(s) of the numerator which sets up a next divisioninto the result of the difference between first digit(s) of thenumerator and a multiple of the divisor.

The first or second data structures are then selected by the multiplexer506 depending on whether the result of the subtraction performed by thesecond ALU logic unit 502 yielded a positive or negative result. If thesubtraction yielded a negative result (which corresponds to the divisornot being able to be divided into the next significant digit of thenumerator), the first data structure is selected 513. If the subtractionyielded a positive result (which corresponds to the divisor being ableto be divided into the next significant digit of the numerator), thesecond data structure is selected 514.

The output of the multiplexer 506 is now understood to be the new mostsignificant word of the PACK data structure (new PACK msw) andcorresponds to the next value in a long division sequence that thedivisor is to be attempted to be divided into. As such, the first ALUlogic unit 501 subtracts 515 the divisor D from the new PACK msw value.The least significant bit 510 of the Y output B[0] is staged to bewritten as a 1 or a 0 depending on the sign of the subtraction resultfrom the first ALU 501 and represents the next digit in the quotientresultant 517, 518.

A second multiplexer 508 selects between first and second datastructures depending 516 on the sign of the first ALU logic unit'ssubtraction 515. A first data structure, presented at input “1” of thesecond multiplexer 508, corresponds to the new PACK msw value. A seconddata structure, presented at input “2” of the second multiplexer 508,corresponds to the result of the subtraction performed by the first ALUlogic unit 501. Which of the two data structures is selected depends onthe sign of the result of the subtraction 515 performed by the first ALU501. If the result of the subtraction is negative, the multiplexerselects the new PACK msw value 517. If the result of the subtraction ispositive, the multiplexer selects the new PACK msw-D value 518.

The output of the second multiplexer 508 corresponds to the NEXT A valuewhich is written into the register file from the X output. The valuepresented at the Y output (B[15:0]) is composed at the leading edge ofthe B operand less its two most significant bits that were consumed bythe two just performed iterations (B[13:0]). The concatenation of theseremainder bits of B with the two newly calculated quotient digitresultants are written into the register file as the new B operand NEXTB. For a next iteration, the X output from the previous instruction isread into the A operand and the Y output from the previous instructionis read into the B operand. The process then repeats until all digits ofthe original B operand have been processed (which, again, in the case ofa 16 bit B operand will consume eight sequential executions of theinstruction). At the conclusion of all iterations, the final quotientwill be written into the register file from the Y output and anyremainder will be represented in the NEXT A value which is written intothe register file from the X output.

FIG. 6 shows an embodiment of a methodology performed by the ALU unitdescribed above. As observed in FIG. 6 the method includes performingthe following with an ALU unit of an image processor. Executing a firstinstruction, the first instruction being a multiply add instruction 601.Executing a second instruction including performing parallel ALUoperations with first and second ALU logic units operatingsimultaneously to produce different respective output resultants of thesecond instruction 602. Executing a third instruction includingperforming sequential ALU operations with one of the ALU logic unitsoperating from an output of the other of the ALU logic units todetermine an output resultant of the third instruction 603. Executing afourth instruction including performing an iterative divide operation inwhich the first ALU logic unit and the second ALU logic unit operate todetermine first and second division resultant digit values 604.

It is pertinent to point out that the various image processorarchitecture features described above are not necessarily limited toimage processing in the traditional sense and therefore may be appliedto other applications that may (or may not) cause the image processor tobe re-characterized. For example, if any of the various image processorarchitecture features described above were to be used in the creationand/or generation and/or rendering of animation as opposed to theprocessing of actual camera images, the image processor may becharacterized as a graphics processing unit. Additionally, the imageprocessor architectural features described above may be applied to othertechnical applications such as video processing, vision processing,image recognition and/or machine learning. Applied in this manner, theimage processor may be integrated with (e.g., as a co-processor to) amore general purpose processor (e.g., that is or is part of a CPU ofcomputing system), or, may be a stand alone processor within a computingsystem.

The hardware design embodiments discussed above may be embodied within asemiconductor chip and/or as a description of a circuit design foreventual targeting toward a semiconductor manufacturing process. In thecase of the later, such circuit descriptions may take of the form of a(e.g., VHDL or Verilog) register transfer level (RTL) circuitdescription, a gate level circuit description, a transistor levelcircuit description or mask description or various combinations thereof.Circuit descriptions are typically embodied on a computer readablestorage medium (such as a CD-ROM or other type of storage technology).

From the preceding sections is pertinent to recognize that an imageprocessor as described above may be embodied in hardware on a computersystem (e.g., as part of a handheld device's System on Chip (SOC) thatprocesses data from the handheld device's camera). In cases where theimage processor is embodied as a hardware circuit, note that the imagedata that is processed by the image processor may be received directlyfrom a camera. Here, the image processor may be part of a discretecamera, or, part of a computing system having an integrated camera. Inthe case of the later the image data may be received directly from thecamera or from the computing system's system memory (e.g., the camerasends its image data to system memory rather than the image processor).Note also that many of the features described in the preceding sectionsmay be applicable to a graphics processor unit (which rendersanimation).

FIG. 7 provides an exemplary depiction of a computing system. Many ofthe components of the computing system described below are applicable toa computing system having an integrated camera and associated imageprocessor (e.g., a handheld device such as a smartphone or tabletcomputer). Those of ordinary skill will be able to easily delineatebetween the two.

As observed in FIG. 7, the basic computing system may include a centralprocessing unit 701 (which may include, e.g., a plurality of generalpurpose processing cores 715_1 through 715_N and a main memorycontroller 717 disposed on a multi-core processor or applicationsprocessor), system memory 702, a display 703 (e.g., touchscreen,flat-panel), a local wired point-to-point link (e.g., USB) interface704, various network I/O functions 705 (such as an Ethernet interfaceand/or cellular modem subsystem), a wireless local area network (e.g.,WiFi) interface 706, a wireless point-to-point link (e.g., Bluetooth)interface 707 and a Global Positioning System interface 708, varioussensors 709_1 through 709_N, one or more cameras 710, a battery 711, apower management control unit 712, a speaker and microphone 713 and anaudio coder/decoder 714.

An applications processor or multi-core processor 750 may include one ormore general purpose processing cores 715 within its CPU 701, one ormore graphical processing units 716, a memory management function 717(e.g., a memory controller), an I/O control function 718 and an imageprocessing unit 719. The general purpose processing cores 715 typicallyexecute the operating system and application software of the computingsystem. The graphics processing units 716 typically execute graphicsintensive functions to, e.g., generate graphics information that ispresented on the display 703. The memory control function 717 interfaceswith the system memory 702 to write/read data to/from system memory 702.The power management control unit 712 generally controls the powerconsumption of the system 700.

The image processing unit 719 may be implemented according to any of theimage processing unit embodiments described at length above in thepreceding sections. Alternatively or in combination, the IPU 719 may becoupled to either or both of the GPU 716 and CPU 701 as a co-processorthereof. Additionally, in various embodiments, the GPU 716 may beimplemented with any of the image processor features described at lengthabove.

Each of the touchscreen display 703, the communication interfaces704-707, the GPS interface 708, the sensors 709, the camera 710, and thespeaker/microphone codec 713, 714 all can be viewed as various forms ofI/O (input and/or output) relative to the overall computing systemincluding, where appropriate, an integrated peripheral device as well(e.g., the one or more cameras 710). Depending on implementation,various ones of these I/O components may be integrated on theapplications processor/multi-core processor 750 or may be located offthe die or outside the package of the applications processor/multi-coreprocessor 750.

In an embodiment one or more cameras 710 includes a depth camera capableof measuring depth between the camera and an object in its field ofview. Application software, operating system software, device driversoftware and/or firmware executing on a general purpose CPU core (orother functional block having an instruction execution pipeline toexecute program code) of an applications processor or other processormay perform any of the functions described above.

Embodiments of the invention may include various processes as set forthabove. The processes may be embodied in machine-executable instructions.The instructions can be used to cause a general-purpose orspecial-purpose processor to perform certain processes. Alternatively,these processes may be performed by specific hardware components thatcontain hardwired logic for performing the processes, or by anycombination of programmed computer components and custom hardwarecomponents.

Elements of the present invention may also be provided as amachine-readable medium for storing the machine-executable instructions.The machine-readable medium may include, but is not limited to, floppydiskettes, optical disks, CD-ROMs, and magneto-optical disks, FLASHmemory, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards,propagation media or other type of media/machine-readable mediumsuitable for storing electronic instructions. For example, the presentinvention may be downloaded as a computer program which may betransferred from a remote computer (e.g., a server) to a requestingcomputer (e.g., a client) by way of data signals embodied in a carrierwave or other propagation medium via a communication link (e.g., a modemor network connection).

In the foregoing specification, the invention has been described withreference to specific exemplary embodiments thereof. It will, however,be evident that various modifications and changes may be made theretowithout departing from the broader spirit and scope of the invention asset forth in the appended claims. The specification and drawings are,accordingly, to be regarded in an illustrative rather than a restrictivesense.

1.-19. (canceled)
 20. An image processor comprising an array ofprocessing units, wherein each processing unit of the array ofprocessing units comprises: four input ports and two output ports; and afirst arithmetic-logic unit (ALU) and a second ALU configured to performa double-width ALU operation, during which: the first ALU is configuredto receive data from a first pair of input ports, to perform a firstfull-width ALU operation to compute (i) a lower half result of thedouble-width ALU operation and (ii) a carry term, to provide the lowerhalf result of the double-width ALU operation to one of the two outputports, and to provide the carry term to the second ALU, and the secondALU is configured to receive data from a second pair of input ports andreceive the carry term from the first ALU, to perform a secondfull-width ALU operation to compute an upper half result of thedouble-width ALU operation, and to provide the upper half result of thedouble-width ALU operation to another of the two output ports.
 21. Theimage processor of claim 20, wherein the each processing unit isconfigured to perform the second full-width ALU operation after thefirst full-width ALU operation is complete.
 22. The image processor ofclaim 21, wherein each processing unit has a carry line between thefirst ALU and the second ALU to provide the carry term to the secondALU.
 23. The image processor of claim 22, wherein the second ALU isconfigured to perform the second full-width ALU operation only uponreceiving the carry term on the carry line.
 24. The image processor ofclaim 20, wherein the first ALU and the second ALU of each processingunit are further configured to perform four half-width ALU operations atleast partially in parallel, during which: the first ALU and the secondALU are each configured to receive input operands from a respective pairof input ports, to perform a first-half width operation on a lower halfof each of the input operands, to perform a second half-width operationon an upper half of each of the input operands, and to write a result toa respective one of the two output ports.
 25. The image processor ofclaim 20, wherein the first ALU and the second ALU of each processingunit are further configured to perform a fused operation comprising asecond operation performed serially on the result of a first operation,during which: the first ALU is configured to receive data from the firstpair of input ports, to perform the first operation, and to provide aresult of the first operation to the second ALU; and the second ALU isconfigured to receive data from one input port of the second pair ofinput ports and to receive the result of the first operation from thefirst ALU, to perform the second operation, and to provide a result ofthe second operation to one of the two output ports.
 26. The imageprocessor of claim 25, wherein the first operation and the secondoperation are different.
 27. A method implemented by a processing unitof an image processing comprising an array of processing units, themethod comprising: performing, by a first arithmetic-logic unit (ALU)and a second ALU of the processing unit, a double-width ALU operationusing data received at a first pair of input ports and a second pair ofinput ports of the processing unit, including: receiving, by the firstALU, data from the first pair of input ports of the processing unit,performing, by the first ALU, a first full-width ALU operation using thedata from the first pair of input ports to compute a lower half resultof the double-width ALU operation and a carry term, providing, by thefirst ALU, the lower half result of the double-width ALU operation toone of two output ports of the processing unit, providing, by the firstALU, the carry term to the second ALU, receiving, by the second ALU,data from the second pair of input ports of the processing unit,receiving, by the second ALU, the carry term from the first ALU,performing, by the second ALU, a second full-width ALU operation usingthe data from the second pair of input ports and the carry term tocompute an upper half result of the double-width ALU operation, andproviding, by the second ALU, the upper half result of the double-widthALU operation to another of the two output ports.
 28. The method ofclaim 27, wherein performing the second full-width ALU operationcomprises performing the second full-width ALU operation after the firstfull-width ALU operation is complete.
 29. The method of claim 28,wherein each processing unit has a carry line between the first ALU andthe second ALU to provide the carry term to the second ALU.
 30. Themethod of claim 29, wherein performing the second full-width ALUoperation performing the second full-width ALU operation only uponreceiving the carry term on the carry line.
 31. The method of claim 27,further comprising: performing, by the first ALU and the second ALU,four half-width ALU operations at least partially in parallel,including: receiving, by the first ALU and the second ALU, respectiveinput operands from a respective pair of input ports, performing, by thefirst ALU and the second ALU, a first-half width operation on a lowerhalf of each of the input operands, performing, by the first ALU and thesecond ALU, a second half-width operation on an upper half of each ofthe input operands, and writing, by the first ALU and the second ALU, aresult to a respective one of the two output ports.
 32. The method ofclaim 27, further comprising: performing, by the first ALU and thesecond ALU, a fused operation comprising a second operation performedserially on the result of a first operation, including: receiving, bythe first ALU, data from the first pair of input ports, performing, bythe first ALU, the first operation, and providing, by the first ALU, aresult of the first operation to the second ALU, receiving, by thesecond ALU, data from one input port of the second pair of input ports,receiving, by the second ALU, the result of the first operation from thefirst ALU, performing, by the second ALU, the second operation, andproviding, by the second ALU, a result of the second operation to one ofthe two output ports.
 33. The method of claim 32, wherein the firstoperation and the second operation are different.
 34. An image processorcomprising an array of processing units, wherein each processing unit ofthe array of processing units is configured to perform a double-widthALU operation, wherein each processing unit comprises: four input portsand two output ports; and means for performing a first full-width ALUoperation using data received at a first pair of the input ports towrite a lower half result of the double-width ALU operation to a firstoutput port and to generate a carry term; and means for performing asecond full-width ALU operation using the carry term and data receivedat a second pair of the input ports and to write an upper half result ofthe double-width ALU operation to a second output port.
 35. The imageprocessor of claim 34, wherein each processing unit is configured toperform the second full-width ALU operation after the first full-widthALU operation is complete.
 36. The image processor of claim 35, whereineach processing unit has a carry line between the means for performingthe first full-width ALU operation and the means for performing thesecond full-width ALU operation.
 37. The image processor of claim 36,wherein the means for performing the second full-width ALU operation isconfigured to perform the second full-width ALU operation only uponreceiving the carry term on the carry line.